Important Note: Grading is based both on your graphs and verbal explanations. Follow all best practices as discussed in class, including choosing appropriate parameters for all graphs, and appropriate color choices. There’s one exception: only one of your graphs needs to be drawn with color vision deficient friendly colors. Choose one, and show that that graph passes the test by including a screenshot taken with a color vision deficiency simulator such as Color Oracle (http://colororacle.org/).
All graphs must be drawn in R. I have provided package suggestions but you may choose other options.
You are expected to develop a basic subject matter understanding of the data, looking up, for example, whether a variable is ordinal or nominal, what the units are, etc. as relevant.
Do not expect the assignment questions to spell out precisely how the graphs should be drawn. Sometimes guidance will be provided, but the absense of guidance does not mean that all choices are ok.
Read Graphical Data Analysis with R, Ch. 6, 7
library(EDAWR)
library(GGally)
library(parcoords)
library(dplyr)
library(ggplot2)
library(tidyverse)
library(ggalluvial)
library(vcd)
library(ggalluvial)
library(tidyr)
[6 points]
Data: nutrition dataset in EDAWR package. Install with:
remotes::install_github(rstudio/EDAWR)
Package suggestions: GGally, parcoords
group column) and at least 8 numeric variables and create a static parallel coordinates plot in which each line represents a food item. Color by food group. You may random sample if you find there is too much data to reasonably display. Choose parameters that best show trends in the data: experiment with alpha blending, scale, order of columns, splines (if available), and size of sample.data_new <- nutrition %>% select(c("zinc","iron","calories","potassium","protein","niacin","magnesium","carbohydrates","group"))
data_new <- data_new[sample(1:nrow(data_new),3000),]
data_new <- filter(data_new, group %in% c("Poultry Products","Cereal Grains and Pasta","Beef Products"))
ggparcoord(data_new,columns = 1:8 ,groupColumn = "group", scale = "uniminmax", alphaLines = 0.3, splineFactor = 0, order = "skewness") +
theme(legend.position = "none")
#为啥要把legend隐藏
data2 = rbind(data_new[3,],data_new[-3,])
parcoords(
data2,
rownames = FALSE
,brushMode = "1d-multi"
,reorderable = TRUE,
color=list(colorBy = "group"),
withD3 = TRUE
)
#我改了一下颜色,为了第三问能方便看一点
[9 points]
Data: NYSERDA Electric Vehicle Drive Clean Rebate Data: Beginning 2017
Package suggestions:
Mosaic plots: vcd or ggmosaic
Treemap: treemap
From the NYS web site: “New York State’s Charge NY initiative offers electric car buyers the Drive Clean Rebate of up to $2,000 for new car purchases or leases. The rebate amount depends on the battery-only range of each vehicle. Dealers enrolled in the program deduct the eligible amount from the vehicle price at the point of sale and then submit a rebate application with NYSERDA. This dataset includes all completed rebate applications as of the data through date.”
https://data.ny.gov/Energy-Environment/NYSERDA-Electric-Vehicle-Drive-Clean-Rebate-Data-B/thd2-fu8y
You can read the data directly from the site with:
read_csv("https://data.ny.gov/resource/thd2-fu8y.csv")
Draw the following graphs and answer the questions.
data2<-read_csv("https://data.ny.gov/resource/thd2-fu8y.csv")
mosaic(rebate_amount_usd~transaction_type,data2)
There is an association between transaction type and rebate amount because the distributions of rebate amount are obviously different with different types of transaction.
test <- chisq.test(table(data2$transaction_type, data2$rebate_amount_usd))
test
##
## Pearson's Chi-squared test
##
## data: table(data2$transaction_type, data2$rebate_amount_usd)
## X-squared = 69.463, df = 3, p-value = 5.562e-15
The p-value of chi square test is too small to accept the null hypothesis, and so we will assume type of transaction and rebate amount are related. It is small with result in (a).
top5 = arrange(plyr::count(data2, "make"),desc(freq))[1:5,1]
top5data = data2[data2$make %in% top5,]
top5data = mutate(top5data, year = format(as.Date(top5data$submitted_date,format="%Y-%m-%d"), "%Y"))
df2c = top5data %>% group_by(make,year) %>% dplyr::summarise(Freq = n())
vcd::mosaic(year ~ make,direction = c("v","h"),labeling = labeling_border(rot_labels = c(0,0,0,45),abbreviate_labs=c(3,2)),top5data)
d) Add
transaction_type to your plot from part c). What new information do you learn?
mosaic(year ~ make+transaction_type,direction = c("v","v","h"),labeling = labeling_border(rot_labels = c(0,0,0,45),abbreviate_labs=c(3,1,2)),top5data)
e) Use the base
pairs() function to draw a mosaic pairs plot of ev_type, transaction_type, rebate_amount_usd and year. Based on the plot, list all pairs of variables from strongest association to weakest association. (Note: The vcd package must be loaded for pairs() to find the correct method.)
library(vcd)
data2_e = mutate(data2, year = str_sub(data2$submitted_date,1,4))[,c(7,8,11,12)]
pairs(xtabs( ~ ev_type+rebate_amount_usd+transaction_type+year, data = data2_e))
library(treemap)
data2_f<-group_by (data2,make,model)%>% dplyr::summarise(count=n(),.groups = 'drop')
data2_test<-group_by(data2,make)%>%dplyr::summarise(count=n())
treemap(data2_f,index=c("make","model"), vSize="count",vColor = "count",title='treemap of make and models',palette='Set3')
[6 points]
Data: Yamaguchi87 in the vcdExtra package
Package suggestion: ggalluvial
library(vcdExtra)
library(ggalluvial)
df3 = Yamaguchi87
ggplot(df3,aes(y=Freq,axis1 = Father, axis2 = Son))+
geom_alluvium(aes(fill = Father),width = 1/12)+
geom_stratum(width = 1/12,fill = "grey")+
geom_label(stat = "stratum",aes(label = after_stat(stratum)))+
#geom_lode(stat = "stratum",aes(label = after_stat(stratum)))+
scale_x_discrete(expand = c(.05, .05))+
scale_fill_brewer(type = "qual", palette = "Set1")+ggtitle("upward / downward mobility by class")+
theme_minimal()
ggplot(df3,aes(y=Freq,axis1 = Father, axis2 = Country,axis3 = Son))+
geom_flow(aes(fill = Father),width = 1/12)+
geom_stratum(width = 1/12,fill = "grey")+
geom_label(stat = "stratum",aes(label = after_stat(stratum)))+
scale_x_discrete(expand = c(.05, .05))+
scale_fill_brewer(type = "qual", palette = "Set1")
df3_US <-filter(df3,Country=="US")
df3_UK <-filter(df3,Country=="UK")
df3_JA <-filter(df3,Country=="Japan")
plotUS<-ggplot(df3_US,aes(y=Freq,axis1 = Father, axis2 = Son))+
geom_alluvium(aes(fill = Father),width = 1/12)+
geom_stratum(width = 1/12,fill = "grey")+
geom_label(stat = "stratum",aes(label = after_stat(stratum)))+
#geom_lode(stat = "stratum",aes(label = after_stat(stratum)))+
scale_x_discrete(expand = c(.05, .05))+
scale_fill_brewer(type = "qual", palette = "Set1")+ggtitle("upward / downward mobility by class(US)")+
theme_minimal()
plotUK<-ggplot(df3_UK,aes(y=Freq,axis1 = Father, axis2 = Son))+
geom_alluvium(aes(fill = Father),width = 1/12)+
geom_stratum(width = 1/12,fill = "grey")+
geom_label(stat = "stratum",aes(label = after_stat(stratum)))+
#geom_lode(stat = "stratum",aes(label = after_stat(stratum)))+
scale_x_discrete(expand = c(.05, .05))+
scale_fill_brewer(type = "qual", palette = "Set1")+ggtitle("upward / downward mobility by class(UK)")+
theme_minimal()
plotJA<-ggplot(df3_JA,aes(y=Freq,axis1 = Father, axis2 = Son))+
geom_alluvium(aes(fill = Father),width = 1/12)+
geom_stratum(width = 1/12,fill = "grey")+
geom_label(stat = "stratum",aes(label = after_stat(stratum)))+
#geom_lode(stat = "stratum",aes(label = after_stat(stratum)))+
scale_x_discrete(expand = c(.05, .05))+
scale_fill_brewer(type = "qual", palette = "Set1")+ggtitle("upward / downward mobility by class(Japan)")+
theme_minimal()
#library("Rmisc")
#library("plyr")
#multiplot(plotUS,plotUK,plotJA, cols=1)
plotUS
plotUK
plotJA
### 4. State Credit Ratings
[9 points]
Data: state_ratings.csv in the data folder of CourseWorks
Package suggestion: ggalluvial
The original data source is here: https://www.spglobal.com/ratings/en/research/articles/190319-history-of-u-s-state-ratings-2185306
The issuer credit rating (ICR) indication, distinguishes some ratings from general obligation debt ratings. A footnote on this site explains that a state’s issue credit rating is listed in place of a general obligation bond rating when a state does not have general obligation debt outstanding. To be consistent with sites like this, and for the purposes of simplification, the ICR indication was removed.
The dates were converted to years, and in cases in which there were more than one entry per year, the latest was selected. For years in which there were no credit rating changes, the latest credit rating available was added. So for example, Wyoming had a listing of AA+ for 2017 which was reduced to AA in 2020. After filling in the inbetween years, we have the following for Wyoming from 2017-2020:
df4 <- readr::read_csv("state_ratings.csv")
#dplyr::filter(df, State == "Wyoming" & Year >= 2017)
data4 <- readr::read_csv("state_ratings.csv")
If you’re interested, the script to read and transform the data is available here: get_ratings_data.R
df4a = df4[,c(1,2,5)]
df4a = df4a %>% pivot_wider(names_from = Year, values_from = Rating) %>% data.frame()
df4a = df4a %>% dplyr::mutate(Freq = n())
df4a = df4a %>% filter(State %in% c("Arizona","California","Georgia","Florida","Illinois","
New York","Washington","Kentucky","Massachusetts","Michigan","Nevada","Ohio"))
ggplot(df4a,aes(y=Freq,axis1 = X2006, axis2 = X2007,axis3 = X2008,axis4 = X2009,axis5 = X2010,axis6 = X2011,axis7 = X2012,axis8 = X2013,axis9 = X2014,axis10 = X2015,axis11 = X2016))+
geom_alluvium(aes(fill = State),width = 1/12)+
geom_stratum(width = 1/12,fill = "grey")+
geom_label(stat = "stratum",aes(label = after_stat(stratum)))+
scale_x_discrete(expand = c(.05, .05))
state_list<-c("Alaska","California","Connecticut" ,"Illinois","Louisiana","New Jersey","Oregon" ,"Pennsylvania","South Dakota","Texas","Wyoming")
df4<-data4
df4a = df4[,c(1,2,5)]
df4a = df4a %>% pivot_wider(names_from = Year, values_from = Rating) %>% data.frame()
df4a = df4a %>% dplyr::mutate(Freq = n())
df4a = df4a %>% filter(State %in% state_list)
ggplot(df4a,aes(y=Freq,axis1 = X2006, axis2 = X2007,axis3 = X2008,axis4 = X2009,axis5 = X2010,axis6 = X2011,axis7 = X2012,axis8 = X2013,axis9 = X2014,axis10 = X2015,axis11 = X2016))+
geom_alluvium(aes(fill = State),width = 1/12)+
geom_stratum(width = 1/12,fill = "grey")+
geom_label(stat = "stratum",aes(label = after_stat(stratum)))+
scale_x_discrete(expand = c(.05, .05))
b) Create a heatmap with the same data (if applicable use the same subset or sample), also showing rating changes by state over time. (One option:
geom_tile().)
df4 %>% filter(State %in% c("Arizona","California","Georgia","Florida","Illinois","
New York","Washington","Kentucky","Massachusetts","Michigan","Nevada","Ohio")) %>% ggplot(aes(x=Year, y=State, fill=Rating))+
geom_tile()
# Heatmap2
df4 %>% filter(State %in% state_list) %>% ggplot(aes(x=Year, y=State, fill=Rating))+
geom_tile()
In heatmap, the ratings of each year for each state are on single line. The color changes alone with rating which is greatly obvious compared to alluvial diagram. Since the rating set and proportion differ for each year, there may be a curve in alluvial diagram even the value doesn’t change by year.
In alluvial diagram, the magnitude of the change are more likely to know. Since rating is a hierarchical value,more than if changes existing, we are also curious about level of changes. The alluvial diagram can reflect degree of change by curvature(length) of flow while heatmap not.